Diamonds Data

Look at Variables and Define New Ones

##   carat clarity color       cut price
## 1  0.51     SI2     I Very Good   774
## 2  0.93      IF     H     Ideal  6246
## 3  0.50    VVS2     D Very Good  1146
## 4  0.30     VS1     F     Ideal   538
## 5  0.31     SI1     F     Ideal   502
## 6  1.00     VS1     F     Ideal  7046
##   carat clarity color       cut price clarity2  carat2
## 1  0.51     SI2     I Very Good   774       SI      <1
## 2  0.93      IF     H     Ideal  6246       IF      <1
## 3  0.50    VVS2     D Very Good  1146      VVS      <1
## 4  0.30     VS1     F     Ideal   538       VS      <1
## 5  0.31     SI1     F     Ideal   502       SI      <1
## 6  1.00     VS1     F     Ideal  7046       VS 1 - 1.9

Description of Variables:

  • carat: Weight of the diamond in carats (0.23 - 7.09)
  • clarity: Measurement of how clear the diamond is
    • I1 : Included Diamonds with obvious inclusions that impact beauty
    • SI2: Slightly Included Diamonds with inclusions detectable to keen unaided eye, especially when viewed from the side
    • SI1: Slightly Included Diamonds with inclusions noticable at 10x magnification (best value)
    • VS2: Very Slightly Included Diamonds with minor inclusions that are somewhat easy to see at 10x magnification
    • VS1: Very Slightly Included Diamonds with minor inclusions that are difficult to see
    • VVS2: Very Very Slightly Included Diamonds with minuscule inclusions that are difficult even for trained eyes to see under 10x magnification
    • VVS1: Very Very Slightly Included Diamonds with minuscule inclusions that are difficult even for trained eyes to see under 10x magnification
    • IF: Internally Flawless Diamonds with no inclusions within the stone, only surface characteristics set the grade
    • FL: Flawless Diamonds with no internal or external characteristics (rare)
  • color: Measurement of faint diamond color
    • D: Rarest and highest quality with a pure icy look
    • E: Rarest and highest quality with a pure icy look
    • F: Rarest and highest quality with a pure icy look
    • G: No discernible color; great value for the quality
    • H: No discernible color; great value for the quality
    • I: No discernible color; great value for the quality
    • J: No discernible color; great value for the quality
  • cut: Cut quality of Diamond
    • Good: This cut represents roughly the top 25% of diamond cut quality. It reflects most of the light that enters, but not as much as a Very Good cut grade.
    • Very Good: This cut represents roughly the top 15% of diamond cut quality. It reflects nearly as much light as the ideal cut, but for a lower price.
    • Ideal: This rare cut represents roughly the top 3% of diamond cut quality. It reflects most of the light that enters the diamond.
    • Astor Ideal: These diamonds are crafted to gather and reflect the most light possible. Cut from the finest raw material (rough stones with as few impurities or inclusions as possible), they meet rigorous quality requirements and exhibit outstanding brilliance, fire, and scintillation. In addition to being graded by the GIA, all Astor by Blue Nile™ diamonds are certified by GemEx®.
  • price: Price is U.S. Dollars ($322 - $355403)

Create Visualizations For Bule Nile Claims

Claim 1: the cut of the diamond is biggest factor in the price

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

Although many diamonds of “Very Good” and “Ideal” cuts are sold for much higher prices than any “Good” diamond is sold for, on average they are less expensive based only on the subset of data we are evaluating. Additionally, all cut qualities seem to have very similar price per carat relationships indicating that they are not a powerful factor in price determination.

Claim 2: the absence of color makes a diamond more expensive

## `geom_smooth()` using formula = 'y ~ x'

The color D means the diamond is pure, icy, and lacking color. As the color values sequence up through the alphabet, the amount of color present increases. It is clear from the plot above that the price per carat is higher for diamonds with less color than those with more. This affirms Blue Niles claim that the absence of color makes a diamond more expensive.

The spike in average price for diamonds the color H may be patrially attributed to the fact that the H colored diamonds in this data set were, on average, larger carat weights than the other colored diamonds.

Claim 3: There is a pattern of “buying shy” in terms of carat weight when purchasing diamonds

filtered_data <- diamond %>%
filter(price <= 9000, carat <= 5)

                     
carat_categories <- cut(filtered_data$carat, breaks = seq(0, max(filtered_data$carat), by = 0.25))
price_categories <- cut(filtered_data$price, breaks = 6)

heatmap_data <- data.frame(
  carat = carat_categories,
  price = price_categories
)

# Count the occurrences of each combination of carat and price
heatmap_counts <- table(heatmap_data)

# Convert the table to a data frame for plotting
heatmap_df <- as.data.frame(heatmap_counts)

# Rename the columns for better readability
names(heatmap_df) <- c("Carat", "Price", "Quantity")

# Create the heatmap using ggplot2
heatmap_plot <- ggplot(heatmap_df, aes(x = Carat, y = Price, fill = Quantity)) +
  geom_tile() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_gradient(low = "white", high = "red") +
  labs(
    title = "Heatmap of Quantity by Carat and Price",
    x = "Carat Weight",
    y = "Price"
  ) +
  theme_minimal()

# Display the heatmap
heatmap_plot


Linear Model Fitting

ggplot(diamond, aes(x=carat,y=price)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
  labs(title="Scatterplot of Price Against Carat", x="Carat", y="Price")
## `geom_smooth()` using formula = 'y ~ x'

Here is the starting data. It seems like our errors do not have mean zero (there are more points above the line, then below, then above) and have increasing variance (points have greater vertical spread as carat increases)

result<-lm(price~carat, data=diamond)
par(mfrow = c(2, 2))
plot(result)

We can see more clearly now that several assumptions are violated. We can see from the residual plot that the residuals do not have mean zero (the line is not parallel to the x-axis) and do not have constant variance (we do not have an even vertical spread of data points as we move from left to right). Also, from the QQ plot we can see that our data are not normally distributed. Finally, from the residuals vs. leverage plot we see that there are several points which have a Cooks distance greater than 1, and should thus be treated as influential outliers. In conclusion, we need to transform both y and x.

result <- lm(price~carat, data=diamond)
library(MASS)
MASS::boxcox(result, lambda = seq(0.2, 0.4, 1/10))

From the Box-Cox plot, we will try the transformation \(y^* = y^{0.3}\)

ystar <- diamond$price^0.3
diamond_transformed<-data.frame(diamond$carat,ystar)

ggplot(diamond_transformed, aes(x=diamond.carat,y=ystar)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
  labs(title="Scatterplot of Price (Transformed) Against Carat", x="Carat", y="Price ^ 0.3")
## `geom_smooth()` using formula = 'y ~ x'

result<-lm(ystar~diamond.carat, data=diamond_transformed)
par(mfrow = c(2, 2))
plot(result)

After transforming the response variable we can see that assumptions 1 and 2 are still not met. We can see from the residual plot that the residuals do not have mean zero (the line is not parallel to the x-axis) and do not have constant variance (we do not have an even vertical spread of data points as we move from left to right).

Let’s see if we can better fix this increasing variance. We know from the Box-Cox plot that \(\lambda < 1\), so let’s try a log transformation.

ystar <- log(diamond$price)
diamond_transformed<-data.frame(diamond$carat,ystar)

ggplot(diamond_transformed, aes(x=diamond.carat,y=ystar)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
  labs(title="Scatterplot of Price (Transformed) Against Carat", x="Carat", y="log(Price)")
## `geom_smooth()` using formula = 'y ~ x'

result<-lm(ystar~diamond.carat, data=diamond_transformed)
par(mfrow = c(2, 2))
plot(result)

These seems to have better fixed the increasing variance compared to the first transformation, so lets stick with \(y^* = log(y)\). We now need to transform x as well.

Based on the scatterplot, it seems like another log transformation for x will be the best option. It seems to match our data well and will allow us to best interpret our results.

xstar<-log(diamond$carat)
diamond_final<-data.frame(xstar,ystar)

ggplot(diamond_final, aes(x=xstar,y=ystar)) + geom_point() + geom_smooth(method = "lm", se=FALSE) +
  labs(title="Scatterplot of Price (Transformed) Against Carat (Transformed)", x="log(Carat)", 
       y="log(Price)")
## `geom_smooth()` using formula = 'y ~ x'

result<-lm(ystar~xstar, data=diamond_final)
par(mfrow = c(2, 2))
plot(result)

It seems like we have finally satisfied the assumptions. Looking at the residual plot, we see that the line indicating the average values of the residuals is parallel to the x-axis, so assumption 1 is satisfied. We also see a more constant vertical variation as x increases, so assumption 2 is satisfied as well.

summary(result)
## 
## Call:
## lm(formula = ystar ~ xstar, data = diamond_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96394 -0.17231 -0.00252  0.14742  1.14095 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.521208   0.009734   875.4   <2e-16 ***
## xstar       1.944020   0.012166   159.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2761 on 1212 degrees of freedom
## Multiple R-squared:  0.9547, Adjusted R-squared:  0.9546 
## F-statistic: 2.553e+04 on 1 and 1212 DF,  p-value: < 2.2e-16

Our regression equation is \(\hat{y}^* = 8.521 + 1.944x^*\), where \(y^* = log(y)\) and \(x^* = log(x)\). We also have \(R^2 = 0.9547\). We can interpret this coefficient as follows: For an \(a\%\) increase in carat, the price is multiplied by approximately \((1+\frac{a}{100})^{1.9}\). This means a \(44\%\) increase in carat corresponds to a doubling in price.


Additional Visuals

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine